Opiates for the Matches: Matching Methods for Causal Inference

نویسنده

Jasjeet S. Sekhon

چکیده

In recent years, there has been a burst of innovative work on methods for estimating causal effects using observational data. Much of this work has extended and brought a renewed focus on old approaches such as matching, which is the focus of this review. The new developments highlight an old tension in the social sciences: a focus on research design versus a focus on quantitative models. This realization, along with the renewed interest in field experiments, has marked the return of foundational questions as opposed to a fascination with the latest estimator. I use studies of get-out-the-vote interventions to exemplify this development. Without an experiment, natural experiment, a discontinuity, or some other strong design, no amount of econometric or statistical modeling can make the move from correlation to causation persuasive. 487 A nn u. R ev . P ol it. S ci . 2 00 9. 12 :4 87 -5 08 . D ow nl oa de d fr om a rj ou rn al s. an nu al re vi ew s. or g by U ni ve rs ity o f C al if or ni a B er ke le y on 0 7/ 08 /0 9. F or p er so na l u se o nl y. ANRV377-PL12-27 ARI 7 April 2009 11:7 RD: regression discontinuity INTRODUCTION Although the quantitative turn in the search for causal inferences is more than a century old in the social sciences, in recent years there has been a renewed interest in the problems associated with making causal inferences using such methods. These recent developments highlight tensions in the quantitative tradition that have been present from the beginning. There are a number of conflicting approaches, which overlap but have important distinctions. I focus here on three of them: the experimental, the modelbased, and the design-based. The first is the use of randomized experiments, which in political science may go back to Gosnell (1927).1 Whether Gosnell randomized or not, Eldersveld (1956) certainly did when he conducted a randomized field experiment to study the effectiveness of canvassing by mail, telephone, and house-to-house visits on voter mobilization. But even with randomization, there is ample disagreement and confusion about exactly how such data should be analyzed—for example, is adjustment by multivariate regression unbiased? There are also concerns about external validity and whether experiments can be used to answer “interesting” or “important” questions. This latter concern appears to be common among social scientists and is sometimes harshly put. One early and suspicious reviewer of experimental methods in the social sciences recalled the words of Horace: “Parturiunt montes, nascetur ridiculus mus” (Mueller 1945).2 For observational data analysis, however, the disagreements are sharper. 1Gosnell may not have actually used randomization (Green & Gerber 2002). His 1924 get-out-the-vote experiment, described in his 1927 book, was conducted one year before Fisher’s 1925 book and 11 years before Fisher’s famous 1935 book on experimental design. Therefore, unsurprisingly, Gosnell’s terminology is nonstandard and leads to some uncertainty about exactly what was done. A definitive answer requires a close examination of Gosnell’s papers at the University of Chicago. 2“The mountains are in labor, a ridiculous mouse will be brought forth,” from Horace’s, Epistles, Book II, Ars Poetica (The Art of Poetry). Horace is observing that some poets make great promises that result in little. By far the dominant method of making causal inferences in the quantitative social sciences is model-based, and the most popular model is multivariate regression. This tradition is also surprisingly old; the first use of regression to estimate treatment effects (as opposed to simply fitting a line through data) was Yule’s (1899) investigation into the causes of changes in pauperism in England. By that time the understanding of regression had evolved from what Stigler (1990) calls the Gauss-Laplace synthesis. The third tradition focuses on design. Examples abound, but they can be broadly categorized as natural experiments or regressiondiscontinuity (RD) designs. They share in common an assumption that found data, not part of an actual field experiment, have some “as if random” component: that the assignment to treatment can be regarded as if it were random, or can be so treated after some covariate adjustment. From the beginning, some natural experiments were analyzed as if they were actual experiments (e.g., difference of means), others by matching methods (e.g., Chapin 1938), and yet others—many, many others—by instrumental variables (e.g., Yule 1899). [For an interesting note on who invented instrumental variable regression, see Stock & Trebbi (2003).] A central criticism of natural experiments is that they are not randomized experiments. In most cases, the “as if random” assumption is implausible (for reviews see Dunning 2008 and Rosenzweig & Wolpin 2000). Regression-discontinuity was first proposed by Thistlethwaite & Campbell (1960). They proposed RD as an alternative to what they called “ex post facto experiments,” or what we today would call natural experiments analyzed by matching methods. More specifically, they proposed RD as an alternative to matching methods and other “as if ” (conditionally) random experiments outlined by Chapin (1938) and Greenwood (1945), where the assignment mechanism is not well understood. In the case of RD, the researcher finds a sharp breakpoint that makes seemingly random distinctions between units that receive treatment and those that do not. 488 Sekhon A nn u. R ev . P ol it. S ci . 2 00 9. 12 :4 87 -5 08 . D ow nl oa de d fr om a rj ou rn al s. an nu al re vi ew s. or g by U ni ve rs ity o f C al if or ni a B er ke le y on 0 7/ 08 /0 9. F or p er so na l u se o nl y. ANRV377-PL12-27 ARI 7 April 2009 11:7 Where does matching fit in? As we shall see, it depends on how it is used. One of the innovative intellectual developments over the past few years has been to unify all of these methods into a common mathematical and conceptual language, that of the Neyman-Rubin model (Neyman 1990 [1923], Rubin 1974). Although randomized experiments and matching estimators have long been tied to the model, recently instrumental variables (Angrist et al. 1996) and RD (Lee 2008) have also been so tied. This leads to an interesting unity of thought that makes clear that the Neyman-Rubin model is the core of the causal enterprise, and that the various methods and estimators consistent with it, although practically important, are of secondary interest. These are fighting words, because all of these techniques, particularly the clearly algorithmic ones such as matching, can be used without any ties to the Neyman-Rubin model or causality. In such cases, matching becomes nothing more than a nonparametric estimator, a method to be considered alongside CART (Breiman et al. 1984), BART (Chipman et al. 2006), kernel estimation, and a host of others. Matching becomes simply a way to lessen model dependence, not a method for estimating causal effects per se. For causal inference, issues of design are of utmost importance; a lot more is needed than just an algorithm. Like other methods, matching algorithms can always be used, and they usually are, even when design issues are ignored in order to obtain a nonparametric estimate from the data. Of course, in such cases, what exactly has been estimated is unclear. The Neyman-Rubin model has radical implications for work in the social sciences given current practices. According to this framework, much of the quantitative work that claims to be causal is not well posed. The questions asked are too vague, and the design is hopelessly compromised by, for example, conditioning on posttreatment variables (Cox 1958, Section 4.2; Rosenbaum 2002, pp. 73–74). The radical import of the Neyman-Rubin model may be highlighted by using it to determine how regression estimators behave when fitted to data from randomized experiments. Randomization does not justify the regression assumptions (Freedman 2008b,c). Without additional assumptions, multiple regression is not unbiased. The variance estimates from multiple regression may be arbitrarily too large or too small, even asymptotically. And for logistic regression, matters only become worse (Freedman 2008d). These are fearful conclusions. These pathologies occur even with randomization, which is supposed to be the easy case. Although the Neyman-Rubin model is currently the most prominent, and I focus on it in this review, there have obviously been many other attempts to understand causal inference (reviewed by Brady 2008). An alternative whose prominence has been growing in recent years is Pearl’s (2000) work on nonparametric structural equations models (for a critique see Freedman 2004). Pearl’s approach is a modern reincarnation of an old enterprise that has a rich history, including foundational work on causality in systems of structural equations by the political scientist Herbert Simon (1953). Haavelmo (1943) was the first to precisely examine issues of causality in the context of linear structural equations with random errors. As for matching itself, there is no consensus on how exactly matching ought to be done, how to measure the success of the matching procedure, and whether or not matching estimators are sufficiently robust to misspecification so as to be useful in practice (Heckman et al. 1998). To illuminate issues of general interest, I review a prominent exchange in the political science literature involving a set of get-outthe-vote (GOTV) field experiments and the use of matching estimators (Arceneaux et al. 2006; Gerber & Green 2000, 2005; Hansen & Bowers 2009; Imai 2005). The matching literature is growing rapidly, so it is impossible to summarize it in a brief review. I focus on design issues more than the technical details of exactly how matching should be done, although the basics are reviewed. Imbens & Wooldridge (2008) have provided an excellent review of recent www.annualreviews.org • Matching Methods for Causal Inference 489 A nn u. R ev . P ol it. S ci . 2 00 9. 12 :4 87 -5 08 . D ow nl oa de d fr om a rj ou rn al s. an nu al re vi ew s. or g by U ni ve rs ity o f C al if or ni a B er ke le y on 0 7/ 08 /0 9. F or p er so na l u se o nl y. ANRV377-PL12-27 ARI 7 April 2009 11:7 developments in methods for program evaluation. For additional reviews of the matching literature, see Morgan & Harding (2006), Morgan & Winship (2007), Rosenbaum (2005), and Rubin (2006). THE NEYMAN-RUBIN CAUSAL MODEL The Neyman-Rubin framework has become increasingly popular in many fields, including statistics (Holland 1986; Rosenbaum 2002; Rubin 1974, 2006), medicine (Christakis & Iwashyna 2003, Rubin 1997), economics (Abadie & Imbens 2006a; Dehejia & Wahba 2002, 1999; Galiani et al. 2005), political science (Bowers & Hansen 2005, Imai 2005, Sekhon 2004), sociology (Diprete & Engelhardt 2004, Morgan & Harding 2006, Smith 1997, Winship & Morgan 1999), and even law (Rubin 2001). The framework originated with Neyman’s (1990 [1923]) model, which is nonparametric for a finite number of treatments where each unit has two potential outcomes for each treatment—one if the unit is treated and the other if untreated. A causal effect is defined as the difference between the two potential outcomes, but only one of the two potential outcomes is observed. Rubin (1974, 2006) developed the model into a general framework for causal inference with implications for observational research. Holland (1986) wrote an influential review article that highlighted some of the philosophical implications of the framework. Consequently, instead of the “Neyman-Rubin model,” the model is often simply called the Rubin causal model (e.g., Holland 1986) or sometimes the NeymanRubin-Holland model (e.g., Brady 2008) or the Neyman-Holland-Rubin model (e.g., Freedman 2006). The intellectual history of the NeymanRubin model is the subject of some controversy (e.g., Freedman 2006, Rubin 1990, Speed 1990). Neyman’s 1923 article never mentions the random assignment of treatments. Instead, the original motivation was an urn model, and the explicit suggestion to use the urn model to physically assign treatments is absent from the paper (Speed 1990). An urn model is based on an idealized thought experiment in which colored balls are drawn randomly from an urn. Using the model does not imply that treatment should be physically assigned in a random fashion. It was left to R.A. Fisher in the 1920s and 1930s to note the importance of the physical act of randomization in experiments. Fisher first did this in the context of experimental design in his 1925 book, expanded on the issue in a 1926 article for agricultural researchers, and developed it more fully and for a broader audience in his 1935 book The Design of Experiments [for more on Fisher’s role in the advocacy of randomization see Armitage (2003), Hall (2007), Preece (1990)]. As Reid (1982, p. 45) notes of Neyman: “On one occasion, when someone perceived him as anticipating the English statistician R.A. Fisher in the use of randomization, he objected strenuously: ‘I treated theoretically an unrestrictedly randomized agricultural experiment and the randomization was considered as a prerequisite to probabilistic treatment of the results. This is not the same as the recognition that without randomization an experiment has little value irrespective of the subsequent treatment. The latter point is due to Fisher, and I consider it as one of the most valuable of Fisher’s achievements.’3 This gap between Neyman and Fisher points to the fact that there was something absent from the Neyman mathematical formulation in 1923, which was added later, even though the symbolic formulation was complete in 1923. What those symbols meant changed. And in these changes lies what is causal about the Neyman-Rubin model—i.e., a focus on the mechanism by which treatment is assigned. The Neyman-Rubin model is more than just the math of the original Neyman model. Obviously, it does not rely on an urn-model motivation for the observed potential 3Also see Rubin (1990, p. 477). 490 Sekhon A nn u. R ev . P ol it. S ci . 2 00 9. 12 :4 87 -5 08 . D ow nl oa de d fr om a rj ou rn al s. an nu al re vi ew s. or g by U ni ve rs ity o f C al if or ni a B er ke le y on 0 7/ 08 /0 9. F or p er so na l u se o nl y. ANRV377-PL12-27 ARI 7 April 2009 11:7 outcomes, but instead, for experiments, a motivation based on the random assignment of treatment. And for observational studies, one relies on the assumption that the assignment of treatment can be treated as if it were random. In either case, the mechanism by which treatment is assigned is of central importance. And the realization that the primacy of the assignment mechanism holds true for observational data no less than for experimental is due to Rubin (1974). This insight has been turned into a motto: “No causation without manipulation” (Holland 1986). Although the original article was written in Polish, Neyman’s work was known in the English-speaking world (Reid 1982), and in 1938 Neyman moved from Poland to Berkeley. It is thus unsurprising that the Neyman model quickly became the standard way of describing potential outcomes of randomized experiments (e.g., Anscombe 1948; Kempthorne 1952, 1955; McCarthy 1939; Pitman 1937; Welch 1937). The most complete discussion I know of before Rubin’s work is Scheffé (1956). And a simplified version of the model even appears in an introductory textbook in the 1960s (Hodges & Lehmann 1964, sec. 9.4).4 The basic setup of the Neyman model is very simple. Let Yi1 denote the potential outcome for unit i if the unit receives treatment, and let Yi0 denote the potential outcome for unit i in the control regime. The treatment effect for observation i is defined by τ i = Yi1 – Yi0. Causal inference is a missing data problem because Yi1 and Yi0 are never both observed. This remains true regardless of the methodology used to make inferential progress—regardless of whether we use quantitative or qualitative methods of inference. The fact remains that we cannot observe both potential outcomes at the same time. Some assumptions have to be made to make progress. The most compelling are offered by a 4The philosopher David Lewis (1973) is often cited for hypothetical counterfactuals and causality, and it is sometimes noted that he predated, by a year, Rubin (1974). The Neyman model predates Lewis. randomized experiment. Let Ti be a treatment indicator: 1 when i is in the treatment regime and 0 otherwise. The observed outcome for observation i is then: Yi = Ti Yi1 + (1 − Ti )Yi0. 1. Note that in contrast to the usual regression assumptions, the potential outcomes, Yi0 and Yi1, are fixed quantities and not random variables, and that Yi is only random because of treatment assignment. Extensions to the case of multiple discrete treatment are straightforward (e.g., Imbens 2000; Rosenbaum 2002, pp. 300–2). Extensions to the continuous case are possible but lose the nonparametric nature of the Neyman model (see Imai & van Dyk 2004).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reducing Bias in Treatment Effect Estimation in Clinical Trials Using Propensity Scores

Propensity score methods are an increasingly popular technique for causal inference. To estimate propensity scores, one must model the distribution of the treatment indicator given a vector of covariates. Much of work has been done in the case of covariates that are fully observed. Many studies, such as longitudinal surveys, suffer from missing covariate. In this paper, different approaches nam...

متن کامل

FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference

A classical problem in causal inference is that of matching, where treatment units need to be matched to control units. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of Average Treatment Ef...

متن کامل

Matching with Multiple Control Groups, and Adjusting for Group Differences

When estimating causal effects from observational data, it is desirable to replicate a randomized experiment as closely as possible, for example, by obtaining treated and control groups with extremely similar distributions of observed covariates. This goal can often be achieved by choosing a subsample from the original control group that matches the treatment group on the distribution of these ...

متن کامل

A Theory of Statistical Inference for Matching Methods in Applied Causal Research

Applied researchers use matching methods for causal inference most commonly as a data preprocessing step for reducing model dependence and bias, after which they use whatever statistical model and uncertainty estimators they would have without matching, such as a difference in means or regression. They also routinely ignore the requirement of existing theory that all matches be exact, and also ...

متن کامل

Propensity score interval matching: using bootstrap confidence intervals for accommodating estimation errors of propensity scores

BACKGROUND Propensity score methods have become a popular tool for reducing selection bias in making causal inference from observational studies in medical research. Propensity score matching, a key component of propensity score methods, normally matches units based on the distance between point estimates of the propensity scores. The problem with this technique is that it is difficult to estab...

متن کامل

The Statistics of Causal Inference: The View from Political Methodology

Many areas of political science focus on causal questions. Evidence from statistical analyses are often used to make the case for causal relationships. While statistical evidence can help establish causal relationships, it can also provide strong evidence of causality where none exists. In this essay, I provide an overview of the statistics of causal inference. Instead of focusing on statistica...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Opiates for the Matches: Matching Methods for Causal Inference

نویسنده

چکیده

منابع مشابه

Reducing Bias in Treatment Effect Estimation in Clinical Trials Using Propensity Scores

FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference

Matching with Multiple Control Groups, and Adjusting for Group Differences

A Theory of Statistical Inference for Matching Methods in Applied Causal Research

Propensity score interval matching: using bootstrap confidence intervals for accommodating estimation errors of propensity scores

The Statistics of Causal Inference: The View from Political Methodology

عنوان ژورنال:

اشتراک گذاری